Consumer Complaint Database

Consumer Complaint Database

In this article, we use an IMDB movie dataset from Kaggle.com. We will develop a model for recommending similar movies to a given movie.

Preprocessing

Removing unnessary columns

Note that

The following three columns are unnecessary for our study. Thus, we are going to remove these columns.

Thus, these columns are dropped from our data

Duplicated Values

First off, note that

These values are duplicated!

Removing the duplicated movies.

Missing values

Director_Name

We can just simply drop rows without director_name.

Column: Budget

We can just simply drop rows without director_name.

We can use sklearn.impute.SimpleImputer for imputation transformer for completing missing values.

Column: Color

For moviess color, we have

We can assume that these movies are all in color since the earliest movie on this list is 1990. Thus,

Column: Language

As for the Language, we have,

We can see undesired values such as NaN and None. First, let's deal with None. We have

We can assume that these movies have been produced in English.

Since most movies are from the USA, we can assume that these movies have been produced in English. Therefore,

Column: Duration

We can see that Duration of some movies are missing.

There is nothing that can be done regarding these movies and we are going to simply drop them.

Drop these data as well.

Column: Country

As for Country, there is only one movie with the country name.

We can search the data for the actor's names and their other movies.

This movie is made in the USA and we can replace NaN with the USA.

Column: Content_Rating

For movie ratings, note that the ratings used since 1996 are source

Rated Description
G General audiences – All ages admitted.
PG Parental guidance suggested – Some material may not be suitable for children.
PG-13 Parents strongly cautioned – Some material may be inappropriate for children under 13.
R Restricted – Under 17 requires accompanying parent or adult guardian.
NC-17 No one 17 and under admitted.

Thus, an standard list of ratings can be found as

However,

We need to convert

To the standard form. We can convert these values using the following table.

Standard Format Data Format
PG G, TV-G, TV-PG, GP
R M
Unrated NaN,Not Rated
Approved Passed
PG-13 TV-14
NC-17 X

Columns contain 'Actor'

As for actors,

Replacing these values with 'None'.

Columns contain 'Likes'

We can replace them with zero.

Remaining NaN values

Column: Gross

First, let's look at the correlation plot for our data.

and

Let's do a further test.

Using the predicated values for gross instead of Nan values.

Saving CSV file


References

  1. Kaggle Dataset: Movie Metadata